feat(celery): Add task and worker lifecycle metrics#4439
Conversation
45f5df3 to
94a36e1
Compare
c6ae222 to
a136151
Compare
10d220d to
a83f1b4
Compare
Introduce Flower-compatible metrics for Celery task and worker events: - flower.events.total: counter for task-sent, task-received, task-started, task-succeeded, task-failed, task-retried, task-revoked - flower.task.runtime.seconds: histogram of task execution time - flower.worker.number.of.currently.executing.tasks: gauge of in-flight tasks per worker - flower.worker.online: gauge tracking worker online/offline status CeleryInstrumentor (child process) handles task tracing and task-level metrics. CeleryWorkerInstrumentor (main process) handles worker lifecycle signals (worker_ready, worker_shutdown) and main-process events (task_received, task_revoked). Prefetch-time metrics (task_prefetch_time_seconds, worker_prefetched_tasks) are intentionally omitted — Celery splits task_received and task_prerun across processes in prefork mode, making it impossible to compute the delta without external event consumption.
a83f1b4 to
2b0c2d5
Compare
MikeGoldsmith
left a comment
There was a problem hiding this comment.
Thanks for the detailed PR @diogosilva30. A few things before we dive into a full review:
Metric naming — using the flower.* namespace copies Flower's Prometheus metric names, but OTel instrumentations should follow OTel semantic conventions rather than replicate another tool's naming. There aren't Celery-specific semconv metrics yet, so the first step would be to open an issue in opentelemetry/semantic-conventions to propose the relevant metric names. We can then implement against those once they're agreed.
Please split this up — at 1600+ lines this is very hard to review with confidence. I'd suggest at minimum:
- Memory leak fix for
task_id_to_start_time(small, can merge quickly) - Housekeeping (type annotations, null guards, typed dataclasses)
- Task-level metrics
CeleryWorkerInstrumentor(new entry point deserves its own discussion)
New CeleryWorkerInstrumentor — adding a new entry point and instrumentor is a significant API addition. Worth discussing the design in the issue first before implementing.
|
Hey @MikeGoldsmith , thanks for detailed feedback. Agree with split-up I've created these two initial PRs:
Thanks also for the point on metric naming. I checked the existing messaging semantic convention metrics, and I think the safe path is to map only the Potential mapping for
For the remaining metrics, I do not think there is a good existing semantic convention fit today:
So the next course of action should be proposing a specific set of semantic conventions opentelemetry/semantic-conventions like it exists for RabbitMQ inside messaging (here)? |
|
Thanks for the PR! Just a heads-up: we no longer update Please add the appropriate changelog fragment for this change instead of editing |
emdneto
left a comment
There was a problem hiding this comment.
@diogosilva30 any pending action here?
|
This PR has been automatically marked as stale because it has not had any activity for 14 days. It will be closed if no further activity occurs within 14 days of this comment. |
|
This PR has been closed due to inactivity. Please reopen if you would like to continue working on it. |
Description
Implements a subset of the Prometheus metrics exposed by Celery Flower directly in the Celery instrumentation, so users who only need basic task/worker observability no longer need to run Flower as a separate service.
Fixes #3458
Type of change
Changes
Task-level metrics (in
CeleryInstrumentor)Metric tracking wired into the existing
prerun/postrun/failure/retryhandlers:flower_events_totalflower.events.totalflower_task_runtime_secondsflower.task.runtime.secondsflower_worker_number_of_currently_executing_tasksflower.worker.number.of.currently.executing.tasksWorker lifecycle metrics (new
CeleryWorkerInstrumentor)A separate instrumentor that hooks into
worker_ready/worker_shutdown/task_received/task_revokedsignals in the main worker process:flower_events_totalflower.events.totalflower_worker_onlineflower.worker.onlineRegistered as a new
celery_workerentry point inpyproject.toml.Omitted metrics
flower_task_prefetch_time_secondsandflower_worker_prefetched_tasksare not implemented. Inpool=preforkmode (the default),task_receivedfires in the main process whiletask_prerunfires in a child process — there is no shared state to compute the time delta between them. Flower can do this because it consumes Celery events externally in a single process; replicating that inside the worker is not feasible without producer-side instrumentation or protocol changes.Bug fix — memory leak in
task_id_to_start_timePreviously,
task_id_to_start_timewas a class-level dict that was never cleaned up after task completion, growing unbounded over time. It is now instance-scoped and entries are removed in_trace_postrun.Housekeeping
__init__.pyandutils.pyCeleryGetterproperly (Getter[Request])detach_context/retrieve_context/retrieve_task_id_from_messageinutils.pydictmetrics store with typed_CeleryTaskMetrics/_CeleryWorkerMetricsdataclassesHow Has This Been Tested?
test_metrics.py, ~1100 new lines)worker_ready/worker_shutdownsignalsTrying it out
A self-contained example project (Django + Celery + Docker Compose + Grafana LGTM) is available at
celery-metrics-example(separate branch on the fork, not part of this PR).Quick start:
Then open Grafana at http://localhost:3000 (
admin/admin) and explore metrics prefixed withflower.*.Does This PR Require a Core Repo Change?
Checklist: